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I.  INTRODUCTION 

Previous  parallel  approaches  to  the  solution  of  optimal  control  problems 
[LAR73],  [MEY73],  [SCH81],  [TRA80]  have  been  devised  without  explicitly  taking 
into  consideration  the  computational  environment.  In  particular,  when  the 
number  of  available  processors  is  small  in  relation  to  the  problem  size,  the  above 
techniques  simply  fold  the  computations  to  fit  the  number  of  processors.  More 
efficient  parallel  algorithms  may  be  devised  by  considering  the  computational 
environment  throughout  the  algorithm  synthesis.  Toward  that  end,  we  present 
in  this  paper  a  parallel  procedure  for  gradient  evaluation  which  is  formulated  as  a 
function  of  the  number  of  available  processors.  Although  presented  in  the  con¬ 
text  of  unconstrained  optimal  control,  our  results  for  gradient  computation  are 
also  applicable  to  constrained  problems.  Furthermore,  we  show  that  the  step  size 
obtained  as  a  result  of  the  line  search  performed  at  each  iteration  may  also  be 
efficiently  computed  in  parallel.  We  then  combine  the  techniques  for  parallel  gra¬ 
dient  evaluation  and  step  size  determination  to  produce  parallel  implementations 
of  the  best-step  steepest  descent  method  and  the  Fletcher-Reeves  conjugate  gra¬ 
dient  method  to  solve  the  linear  quadratic  regulator  (LQR)  control  problem 
[LEW86], 

It  is  well  known  that  a  closed  loop  feedback  form  solution  exists  for  the  LQR 
problem.  Our  motivation  for  solving  that  problem  using  iterative  gradient  based 
techniques  is  that  our  basic  parallel  approach  can  be  applied  to  more  complex 
control  problems  in  which  the  system  dynamics  can  be  linearized  and  the  cost 
approximated  quadratically.  Furthermore,  efficient  parallel  implementations  of 
gradient  methods  such  as  best-step  steepest  descent  and  conjugate  gradient  sug¬ 
gests  that  similar  parallel  implementations  of  penalty  function  and  gradient  pro¬ 
jection  methods  may  be  used  to  solve  constrained  control  problems. 


,*  v  v  v\‘  *,*  W  v\* 1  »**  •’*  A  • 
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Our  approach  to  the  parallel  evaluation  of  the  step  and  gradient  reduces  the 
total  number  of  operations  required  by  sharing  common  terms  when  possible  and 
then  introduces  parallelism.  The  degree  of  parallelism  exhibited  by  the  step  and 
gradient  computation  techniques  presented  in  this  paper  varies  as  a  function  of 
the  number  of  processors  to  be  used.  We  constrain  the  number  of  available  pro¬ 
cessors,  p  ,  to  lie  in  the  range  1  <  p  <  uN* ,  where  n  is  the  size  of  the  system 
state  vector,  N  is  the  number  of  stages  in  the  control  process  and  we  assume 
n  >  m  ,  where  m  is  the  size  of  the  control.  One  of  the  features  of  the  proposed 
parallel  iterative  algorithms  is  that  their  structure  is  completely  specified  by  the 

number  of  processors  whenever  the  number  of  stages  N  > 


An  efficient  technique  for  gradient  evaluation  using  a  single  processor  has 
been  discussed  by  Polak  [POL71,  pp. 66-69].  A  direct  implementation  of  this 
technique  on  p  processors  achieves  linear  speedup  for  p  up  to  n  ;  however,  for 
p  >  n  ,  the  speedup  is  significantly  reduced.  In  this  paper,  we  present  an 
approach  to  gradient  computation  that  may  be  efficiently  implemented  on  up  to 
nN*  processors  and  reduces  to  a  direct  parallel  implementation  of  the  gradient 
evaluation  given  in  [POL71]  when  1  <  p  <  n  .  However,  for  n  <  p  <  njV* , 


we  show  that  our  approach  achieves  speedup  greater  than  — (p  +  n  ). 

2 


A  critical  step  in  our  approach  involves  the  parallel  computation  of  the  state 
and  costate  vectors.  When  «  =  1,  the  computations  reduce  to  solving  a  forward 
linear  recurrence  system  followed  by  a  reverse  linear  recurrence  system,  both  of 
size  N .  The  parallel  evaluation  of  m-th  order  linear  recurrence  systems  has  been 
extensively  studied  [KOG73],  [CHE75],  [KUC76],  [SAM75],  [SAM77],  [CHE78], 
[GAJ81],  [CAR84],  [MEY85]  and  [MEY86].  To  solve  first-order  linear  block 
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recurrence  systems  in  parallel,  we  use  a  blocked  formulation  of  the  approach 
presented  in  [MEY86],  [MEY86].  In  addition  to  requiring  less  steps  to  solve  a 
first-order  system  than  any  of  the  above  when  1  <  p  <  nN*  ,  the  approach  in 
[MEY86]  can  be  implemented  on  a  simple  ring  network  with  broadcasting  capa¬ 
bility. 

The  organization  of  this  paper  is  as  follows  :  in  Section  n,  we  state  the 
unconstrained  discrete  linear  quadratic  optimal  control  problem,  examine  the  gra¬ 
dient  of  the  cost  function  and  give  the  steepest  descent  algorithm  we  shall  con¬ 
sider.  Section  III  presents  the  step  length  and  gradient  computations  required  at 
each  iteration.  In  Section  IV  we  give  parallel  procedures  to  solve  the  linear 
recurrence  systems  required  by  Section  III.  Based  upon  the  results  of  Sections  III 
and  IV,  Section  V  presents  parallel  implementations  of  the  best-step  steepest  des¬ 
cent  method  to  solve  the  LQR  problem  and  the  corresponding  performance 
analysis.  Finally,  in  Section  VI  conclusions  are  presented. 

H.  PRELIMINARIES 

We  consider  the  LQR  discrete  optimal  control  problem: 

Problem  1:  Given  an  m  -input,  discrete,  time-varying  linear  system  in  which  we 
are  given  the  initial  state,  z0,  and 

=  A,  2.  .J  +  fl,  u, ,  t  =  1,2,...V,  (1) 

where  for  «  =  0,l,...,iV ,  2,  in  En  is  the  state  of  the  system  at  time  i  and  for  i 
=  1,2 ,...,V,  u,  in  Em  is  the  control  at  time  t  ,  find  the  mV  control  vector 
u  =  (u  j  ,  til;  ,  .  .  .  ,  u^)1  fchat  minimizes  the  performance  index 

J(u)  =  ~  (  2*  Qz  -I-  u(  Ru  ), 

where  z  is  the  nV  vector  2  —  (z[  ,  z 2 . z^)1 ,  Q  =  Dtag  (Q  lt 


-r. 


Qn  )  is  the  nN  XnN  block  matrix  that  consists  of  N  n  Xn  symmetric  positive 
semi-definite  blocks  <?,-  and  R  =  Diag  ( R  1(  R  2,...,  RN )  is  the  mN  XmN  block 
matrix  that  consists  of  N  m  Xm  symmetric  positive  definite  blocks  R, . 


The  hypotheses  on  the  matrices  Q  and  R  insure  that  Problem  1  possesses  a 

=  0 

U  =  V 

We  now  introduce  a  formulation  for  —  that  is  used  in  our  parallel  imple- 

du 

mentation  of  gradient  algorithms. 

A  direct  application  of  the  chain  rule  for  differentiation  [GRA81]  yields 

dJ  _  dJ  |  dJ  dz 
du  du  dz  du  ’ 


unique  solution  u  *  [LUE84]  and  that 


where  the  2,  ’s  are  determined  in  accordance  with  Eq.  (l) 


dJ  ,  dJ 
,  ——  and  — —  are  the 
du  dz 


1  XmN  and  lXniV  Jacobian  matrices  u  1  R  and  zl  Q  respectively,  and  — —  is 

du 

the  nN  XmN  block  lower  triangular  Jacobian  matrix  that  consists  of  N2  n  Xm 


blocks 


dzi 

duj 


obtained  by  the  chain  rule  for  all  i  and  j  in  [l,2,...,iV 


by 


0  if  t  <  j 

Bj  if  t  =  j  (3 

Ai- 1  '  '  ’  A  j  +  iBj  if  i  >  j. 

Eq.  (3)  shows  that  the  influence  matrix  —■  satisfies  Fz  =  Fu  ,  where  F . 

du  du 

is  the  nN  XnN  block  lower  bidiagonal  matrix  that  consists  of  N~  n  Xn  blocks 
^ Fz  j  defined  for  all  t  and  j  in  [l,2,...,iV]  by 
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/  if  t  =  j 

(f>  )„  =  { -A<  if*=>+i 

0  otherwise 

and  Fu  is  the  nN  XmN  block  diagonal  matrix  that  consists  of  TV2  n  Xm  blocks 
|  defined  for  all  i  and  j  in  [l,2,...,7V  j  by 


(f‘  ),  =  {  o' 


if  i  =  j 

otherwise. 


Let  g  —  ——  be  the  the  mN  XI  gradient  of  J  (u  )  with  respect  to  u  .  The 
du 

matrix  Ft  is  non-singular  and  thus  we  may  rewrite  Eq.  (2)  as 


g  =  Ru  +  F‘ Fz  lQz  , 


where  z  again  satisfies  Eq.  (1). 


Let  F0  be  the  nN  Xn  block  matrix  that  consists  of  N  n  Xn  blocks  j  . 


defined  for  all  i  in  [1,2,.. .,7V]  by 


if,=1 

^  0  J ,  ^  0  otherwise  , 


and  let  X  be  the  nN  costate  vector  X  =  (X/,  X2<,  .  .  .  ,  ,  X,  in  defined 


X  =  F~'Qz. 

Then,  given  u  and  z0,  the  gradient  g  may  be  obtained  by  using  the  three  equa- 


Fz  =  Fu  u  +  F0z0, 


F!\  =  Qz  , 


y- v* 


■  a  •**  a  C.  «  *  .1 


With  the  notation  g 


d£ 

du 


t 


k  _ 


the  version  of  the  best-step  steepest 


descent  method  that  we  use  is  the  following: 

Algorithm  1:  Let  u  1  be  given. 

Step  0:  Set  k  —  1  and  compute  g  l. 

Step  1:  If  Iff*  |2  S  f  stop;  else  go  to  Step  2. 

Step  2:  Compute  ak  to  minimize  J  (u  k  -  ak  g  k  ). 

Step  3:  Compute  g  k  +1. 

Step  4:  Let  uk  +l  =  uk  -  ak  gk  . 

Step  5:  Set  k  =  k  - f-  1  and  go  to  Step  1. 

Note  that  we  compute  g  1  in  Step  0  and  then  compute  g k  +1  in  Step  3, 
after  the  computation  of  ak  in  Step  2.  In  Section  III  we  present  our  approach  to 
the  computation  of  g1,  ak  ,  and  g  k  +1  and  we  show  that  the  computation  of  ak 
and  gk+l  shares  common  terms.  In  Section  IV,  we  discuss  efficient  parallel  pro¬ 
cedures  for  solving  block  linear  recurrences  and  we  use  those  procedures  in  Sec¬ 
tion  V  to  obtain  a  parallel  implementation  of  Algorithm  1. 


m.  GRADIENT  AND  BEST  STEP  COMPUTATION 


We  first  consider  the  computation  of  g  \  which  is  performed  only  once  in 
Step  0  of  Algorithm  1.  As  a  consequence  of  Eqs.  (4),  (5)  and  (6)  we  obtain  the 
gradient  evaluation  technique  proposed  by  Polak  [POL7IJ: 

Algorithm  2:  Given  u  1  and  z0. 


Step  1:  Compute  z  1  such  that  Fz  z  1  =  Fu  u  1  +  F  0z0. 


I 

Step  2:  Compute  X1  such  that  F/X1  =  Qzl. 

Step  3:  Compute  g  1  =  Ru  1  +  F^X1. 

Due  to  the  lower  n  Xn  block  bidiagonal  structure  of  Fz  ,  Steps  1  and  2 
of  Algorithm  2  require  procedures  for  the  solution  of  N  stage  forward  and  reverse 
n  Xn  block  first-order  linear  recurrences,  respectively.  Such  procedures  are 
presented  in  next  section.  Given  X1,  Step  3  computes  g  1  by  computing  each  of 

t 

;  thus,  Step  3  exhibits  linear 

u  =  u 1 

speedup  when  executed  in  parallel. 

We  now  consider  the  computation  of  the  optimal  step  length  ak  .  The 
cost  function  is  quadratic  and  therefore  a  closed  form  solution  for  ak  exists.  It  is 
clear  that 


the  N  uncoupled  components  g, 1  =  j  -~- 


J(u)  =  ±u'(R  +  FtF'-'QFr'FJu, 


therefore 


where 


J(uk  -agk)=aa2+ba+c, 


a  =  ~<gk  ,dk  >, 


b  =  -<gk  ,gk  >, 


c  =/(«*), 

dk  =(R  +  F‘Fz-tQFz-lFu)gk  ,  (7) 


and  it  follows  that  the  optimal  step  length  a*  is 


Once  d  k  is  known,  the  gradient  g  *  +1  is  easy  to  evaluate  using 

g  k  +1  =  g  k  -  ak  dk  . 

The  matrix  ( R  +  FxF~tQFz~xFu  )  may  be  precomputed  and  the  quan¬ 
tity  d k  may  be  obtained  by  performing  a  single  matrix-vector  product.  How¬ 
ever,  we  show  in  Section  V  that  a  more  efficient  approach  can  be  obtained  by 
rewriting  dk  as 

dk  =Rgk  +F*F,-tQF,-lFMgk, 

and  using  Algorithm  3  below,  where  we  note  that  instead  of  requiring  Fz~x  and 
Fz~*,  we  solve  linear  systems  corresponding  to  Fz  and  Fz. 

Algorithm  S:  Given  g  k  . 

Step  1:  Compute  fik  —  Rgk  . 

Step  2:  Compute  6k  —  Fu  gk  . 

Step  3:  Compute  w  *  such  that  Fzwk  =  6k  . 

Step  4:  Compute  rjk  =  Qw  k  . 

St^p  5:  Compute  vk  such  that  Fzvk  =  r\k  . 

Step  6:  Compute  x*  =  F*vk. 

Step  7:  Compute  dk  =  nk  +  x*  . 

With  the  exception  of  Steps  3  and  5,  Algorithm  3  computes  dk  by  exe¬ 
cuting  a  series  of  matrix-vector  products  followed  by  a  vector  sum.  Each  of  the 
matrix-vector  products  consists  of  N  uncoupled  block  matrix-vector  products, 
thus  exhibiting  linear  speedup  when  implemented  in  parallel.  However,  due  to 
the  structure  of  Fz ,  Steps  3  and  5  require  the  solution  of  .V  stage  forward  and 
reverse  n  Xn  block  first-order  linear  recurrences,  respectively.  As  in  the  case  of 
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Algorithm  2,  this  again  suggests  the  need  for  parallel  procedures  to  solve  linear 
recurrences.  Note  that  the  state  zk  +1  and  costate  X*  +1  corresponding  to  u  k  +1 
can  be  obtained  easily  from  the  quantities  w  k  and  v k  computed  in  Steps  3  and 
5  of  Algorithm  3,  that  is, 

2*+1==2*  -  ak  wk 


\k+l  =  X*  -  ak  v‘ 


TV.  PARALLEL  PROCEDURES  FOR  LINEAR  BLOCK  RECURRENCES 

The  model  of  S1MD  parallel  computation  that  we  use  consists  of  a  global 
parallel  memory,  p  parallel  processors,  and  a  control  unit,  where  all  processors 
perform  the  same  operation  at  each  time  step  (see  Fig.  1).  We  further  simplify 
the  model  by  making  the  following  assumptions: 

Al.  Each  computational  operation  takes  the  same  amount  of  time,  referred  to 
as  a  step. 

A2.  There  are  no  accessing  conflicts  in  global  memory. 

A3.  All  initial  data  resides  in  global  memory. 

A4.  There  is  no  time  required  to  access  global  memory. 

We  now  present  two  parallel  procedures  to  solve  forward  and  reverse  N 
stage  n  Xn  block  first-order  linear  recurrence  systems  that  we  use  to  implement 
Algorithms  2  and  3.  The  procedures  are  blocked  versions  of  the  parallel  scalar 
approach  given  in  [MEY86]  and  are  formulated  as  a  function  of  the  number  p  of 
processors  so  that  their  structure  is  fixed  whenever  the  number  of  stages 


N  >  — 

n 


m  *  .  m  •  •  •  »  •  /  • 

s'  •.*  «,•  -.  -.  •.  ■. 


\  -•  a  /.  .n  ;• 


-V  ' 


We  first  consider  the  parallel  solution  of  forward  recurrences.  The  for¬ 
ward  recurrence  problem  is:  given  n  Xn  matrices  A, ,  t  =  2,  3,...,  N  and  given 
vectors  7,-  e  EH  ,  i  =  1,2 ,  find  the  n  vectors  zi  such  that  z  \  =  l\  and  zi 
=  Aj  z,_  1  -I-  7,- ,  *  =2,  3,...,  N .  Let 


n  = 


N-l 

(P  /«  )2-!) 
JV  -  1 


if  p  >  n 
otherwise 


(8) 


/  iM  —  {«*  +  Max  (uj+1,uj(k2-1))  :  i  —  /c,k-1,...,1}. 

Thus,  given  7  =  (7/, 72*,  ■  ■  ■  >  InY  1  li  c  En  and  precomputed  A  [»'  ]  = 

A,'+;-A,-+y_  1  ■  ■  ■  Ay,  j  e  f  0(u;),  »  in  [0,l,...,/c-l],  the  following  procedure  solves 
the  forward  block  recurrence  system,  where  for  presentation  simplicity,  we 
assume  that  and  k  are  integers. 

1.  PROCEDURE  FORWARD(Ar,n  ,p  ,7) 

2.  2  j  :=  7i 

3.  FOR  w:=OTOfi-lDO 

4.  FORALL  j  t  /  0(a;)  DO  IN  PARALLEL  2;  :=  7y  ; 

5.  FOR  »  :=  1  TO  Max  (1,/c-l)  DO 

6.  FORALL  j  e  f  L(u;)  DO  IN  PARALLEL 
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2i  +;  •  -^i  +j  zi+y-i  "F  Tfi+y» 

8.  END  FORALL 

9.  END  FOR 

10.  FORALL  j  t  f  0(u;)  DO 

11.  FOR  i  :=  0  TO  /c-1  DO  IN  PARALLEL 

12.  zi+i  A  [*  "h J  »J  \ Z)  -1  "I"  2i'+;  * 

13.  END  FOR 

14.  END  FORALL 

15.  END  FOR 

16.  END  PROCEDURE 

When  1  <  p  <  n  ,  the  index  set  /  0(u>)  is  empty  and  procedure  FOR¬ 
WARD  reduces  to  sequentially  executing  step  7  JV-1  times,  each  execution  using 
p  processors.  When  n  <  p  <  nN *  ,  procedure  FORWARD  sequentially  solves 
0  reduced  block  recurrence  systems  in  z{  of  size  (p  /n  )2,  each  in  parallel.  Each 
reduced  system  is  solved  in  two  phases:  the  first  phase  consists  of  the  execution  of 
steps  4  and  5  and  computes  (p  /n  )2  partial  solutions,  in  which  the  first  p  /n  are 
the  actual  solutions  and  the  second  phase  consists  of  the  execution  of  loop  10  in 
which  the  precomputed  n  Xn  block  matrices  A  [»  +j  ,j  ]  are  used  to  update  the 
next  p  /n  partial  solutions  at  each  iteration.  We  assign  n  processors  to  perform 
each  of  the  p  jn  concurrent  executions  of  steps  7  and  12.  The  complete  solution 
to  the  block  recurrence  system  is  obtained  after  executing  0  iterations  of  loop  3. 

If  fi  is  not  an  integer,  then  we  replace  H  by  Tn  ]  and  simply  terminate  the  com¬ 
putation  when  zN  is  computed  and  if  k  is  not  an  integer,  we  replace  k  by 

Lp/n  J- 

We  now  modify  the  procedure  FORWARD  to  solve  N  stage  n  Xn  block 
reverse  linear  recurrence  systems,  where  the  reverse  recurrence  problem  is:  given 
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n  Xn  matrices  A, ,  i  =  N -1,N and  given  vectors  e  En  ,  i  = 
N,iV-l,...,l,  find  the  n  vectors  X,-  such  that  \N  =  Cat  and  X,  =  A,  X,  +1  +  , 

i  =  N -1,N Let  n  and  k  be  defined  by  Eqs.  (8)  and  (9).  For  w  in 
[0,l,...,fJ-l],  define  the  index  sets 

r0M  =  {/ci  +  w(/c2-l)  :  i  =  /c-l,/c-2,...,l} 


r  i(w)  =  {/ci  +  Max  (cj+1,cj{k2-1))  :  i  =  /c,/c— } . 

Given  c  =  .  .  .  ,  ,  ?,•  £  En  and  precomputed  A  [j  +l,j-i  +l](  = 

Aj_i+lAj_i  ■  ■  ■  Aj+y  ,  j  e  r0(ui),  i  in  [0,1,...,k-1],  the  following  procedure 
solves  the  reverse  block  recurrence  system,  where  for  presentation  simplicity,  we 
again  assume  that  Q  and  k  are  integers. 

1.  PROCEDURE  REVERSE(JV  ,n  ,p  ,?) 

2.  <;N 

3.  FOR  u:=n-lTOODO 

4.  FORALL  j  e  r0(w)  DO  IN  PARALLEL  Xy  :=  ; 

5.  FOR  i  :=  1  TO  Max  (1,/c-l)  DO 

6.  FORALL  j  e  r  DO  IN  PARALLEL 

7.  \j-i  •—  Ay_,'+j  Xy_(+j  +  |  ; 

8.  END  FORALL 

9.  END  FOR 


10.  FORALL  j  £  r0(u>)  DO 

11.  FOR  I  :=  0  TO  k-1  DO  IN  PARALLEL 

12.  Xj  _,  A  [j +l,j-i +1}* \J  +I  +  \j_t- ; 

13.  END  FOR 


14.  END  FORALL 


-  15- 


15.  END  FOR 


16.  END  PROCEDURE 


We  now  give  the  number  of  steps  required  to  solve  either  an  TV  stage  for¬ 


ward  or  reverse  first-order  n  X  n  block  linear  recurrence  system. 


Theorem  1:  Given  TV,  n  and  p  such  that  p  =1  or  —  is  an  integer,  the  number 

n 


of  parallel  steps  required  to  solve  a  block  n  X  n  first-order  linear  recurrence  sys¬ 


tem  of  length  TV  using  p  processors  is 


(TV  -1)- 


if  1  <  p  <  n 


T(N,n,p)  = 


TV  -  1 


4 (p  -  n  )  if  n  <  p  <  nTV14  . 


*  -1) 
n 


It  is  clear  from  Theorem  1  that  the  speedup  S .  =  — —  exhibited  by  the 


procedures  FORWARD  and  REVERSE  is  p  when  1  <  p  <  n  and  — (p  +  n  ) 

2 


when  n  <p  <  nTV*4  and  —  and 

n 


are  integers.  The  corresponding 


-M-i 


efficiency  E.  =  - is  therefore  1  when  1  <  P  <  n  and  —  -I - when 


2  2  p 


n  <p  <  nTV14  and  —  and  — ^—4 —  are  integers.  In  Figs.  2,  3,  4  and  5  we  plot 

M  £  \  * 


-1 


Ep  for  the  values  of  n  =8,  16,  32  and  64  respectively,  where  the  efficiency 


corresponding  to  the  procedures  FORWARD  and  REVERSE  is  denoted  by  the 


dashed  line  in  each  plot.  Thus,  we  see  that  the  efficiency  increases  with  increas- 


iwv 


ing  values  of  n  and  p  and  is  independent  of  TV . 

V.  PARALLEL  BEST  STEP  STEEPEST  DESCENT 

We  now  use  the  parallel  procedures  for  the  solution  of  linear  recurrences 
discussed  in  the  previous  section  to  obtain  parallel  implementations  of  Algorithms 
2  and  3  give  the  corresponding  number  of  steps  required  for  their  execution  when 
p  processors  are  used. 

We  first  give  the  parallel  implementation  of  Algorithm  2. 

1.  PROCEDURE  GRADIENT^  0,«  >) 

2.  FORALL  *'  e  {1,2, ...,7V  }  DO  IN  PARALLEL  V  :=  R,  u, l; 

3.  7il  :=  711  +  A  i*o* 

4.  z  1  =  FORWARD(7V  ,n  ,p  ,V); 

5.  FORALL  «  e  {1,2 }  DO  IN  PARALLEL  c,1  :=  Qx  z{  l; 

6.  X1  =  REVERSE(iV  ,n  ,p 

7.  FORALL  i  e  {1,2,.. .,7V  }  DO  IN  PARALLEL  g{  1  :=  R,  «,■ 1  + 

8.  RETURN  g1; 

9.  END  PROCEDURE 


Lemma  1:  Given  zQ,  ul,  TV ,  n  ,  m  and  p  such  that  p  =  1  or  —  is  an  integer, 

n 

the  number  of  steps  required  by  the  procedure  GRADIENT  to  compute  g  1  using 
p  processors,  1  <  p  <  nTV14 ,  is 


We  next  give  the  parallel  implementation  of  Algorithm  3. 


1.  PROCEDURE  DIRECTION^  *  ) 

2.  FORALL  »'  €  {1,2 ,...,7V  }  DO  IN  PARALLEL  n,k  :=  R,  </,*; 

3.  FORALL  »  c  {1,2,..., AT }  DO  IN  PARALLEL  6k  :=  B{  <?,*; 

4.  w*  =  FORWARD(7V  ,n  ,p  ,6*  ); 

5.  FORALL  t  e  {1,2, ...,7V}  DO  IN  PARALLEL  v,k  :=  Q,  w,*; 

6.  v*  =  REVERSE(7V,n,p,f?*); 

7.  FORALL  t  e  {1,2,.. .,7V}  DO  IN  PARALLEL  x,k  :=  B/w,-*; 

8.  FORALL  i  t  {1,2,.. .,7V  }  DO  IN  PARALLEL  df  :=  /i,*  +  x,*5 

9.  RETURN  (/*; 

10.  END  PROCEDURE 

Lemma  2:  Given  g  *  ,  TV ,  n  ,  m  and  p  such  that  p  =  1  or  —  is  an  integer,  the 

n 

number  of  steps  required  by  procedure  DIRECTION  to  compute  d  k  using  p  pro¬ 
cessors,  1  <  p  <  riTV*4 ,  is 


T  (N  ,n  ,m  ,p  )  = 


-^(8n  +2m  -2)  + 


Nm 

P 


(2n  +2m  -1)  - 


4ns 


Nn 


(2n  +2m-2)  + 


Nm 


(2 n  -t-2 m  -1)  + 


yv-i 


V* 

'  -1 
n 


if  1  <  p  <  n 
|8(p  -n  )  if  n  <  p  <  nN 


At  this  point  it  is  interesting  to  contrast  the  result  of  Lemma  2  with  the 
number  of  steps  necessary  to  obtain  d  *  based  upon  the  precomputation  of  the 
matrix  ( R  -I-  F^F^QF^F^  ).  Given  p  processors,  1  <  p  <  Nm  ,  the  number 
of  steps  required  to  obtain  d  k  using  Eq.  (7)  is 

T(7V  ,m  ,p  )  = 


Nm 

P 


(27Vm  -  1). 


Thus,  for  p  in  the  range  1  <  p  <  nTV'6 ,  the  use  of  Algorithm  3  to  compute  dk 
results  in  a  smaller  number  of  steps  than  using  the  precomputed  matrix 


8- 


(R  +  F*F;  lQF2  lFu  )  and  performing  a  matrix-vector  product. 

We  now  embed  the  parallel  procedures  GRADIENT  and  DIRECTION 
to  obtain  a  parallel  implementation  of  Algorithm  I  and  we  then  give  the 
corresponding  number  of  steps  required  for  one  iteration. 

1.  PROCEDURE  PSDM(20,u  *) 

2.  k  =  1; 

3.  g  1  =  GRADIENTS  *); 

4.  WHILE  |  <7*  |2  >  e  DO 

5.  dk  =  DIRECTION^*); 

6.  ak  =  <gk  ,gk  >/<gk  ,dk  >; 

7.  FORALL  »  t  {1,2,.. .,N  }  DO  IN  PARALLEL  g{k  +1  :=  g*  -  ak  d,*; 

8.  FORALL  t  e  (1,2 ,...,N  }  DO  IN  PARALLEL  «,*  +1  :=  u,k  -  ak  </,*; 

9.  k  =  k  +  1; 

10.  END  WHILE 

11.  END  PROCEDURE 


Theorem  2:  Given  zQ,  ul,  N ,  n  ,  m  and  p  such  that  p  =  1  or  —  is  an  integer, 

n 

the  number  of  steps  required  by  one  iteration  of  procedure  PSDM  using  p  proces¬ 
sors,  1  <  p  <  nN* ,  is 


T  (N , n  ,m  ,p  )  = 


Nn 


P 

Nn 


(8n  +2 m  -2)  +• 
(2n  +2m  -2)  + 


Nm 

P 

Nm 

P 


|(2n  +2 m  +7)  -  L!_  +  2 togrf 

P 


if  1  \  p  ^  n 


(2n  42m  +7)  + 


N- 1 
,T 
P_ 
n 


8(j>  -n  )  +  ilogji  if  n  <  p  <  nN 


The  speedup  Sp  and  efficiency  Ep  for  procedure  PSDM  can  be  obtained 
directly  from  Theorem  2  as 


\*\  •p. 


y.;-.-. 


where 


T  j  =  Nn  (6n  +  2 m  -  2)  +  Nm  (2 n  +  2m  +  7)  -  4 n  2, 


T.  =  —  {2n  +  2m  -  2)  +  — (2n  +  2m  +  7)  +  8^N  ^  +  2 log  2p  , 
Pp  P  p  +  n 


and  for  simplicity,  the  ceilings  have  been  removed.  For  the  range  of  values  of  N , 
n  and  m  of  interest 


T  i  Nn  (6n  +  2m  -  2)  +  Nm  (2n  +  2m  +  7), 


— (2n  +  2m  -  2)  +  —  (2n  +  2m  +  7)  + 
p  p  p  +  n 


and  thus 


p(C  +  4  n2) 


C  + 


8n  2p 


p  +  n 


C  +  4n2 


C  + 


8  n2p 


p  4-  n 


where 


C  =  2(n  +  m  )2  -  2n  +  7m. 


Using  the  fact  that 


— 2 —  <  1 
p  +  n 


we  may  find  lower  bounds  for  the  speedup  and  efficiency  of  procedure  PSDM, 
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r  w  «r - v  ‘  -  .»  ‘  "jr  t*  .*  w>  -  .  »  >.  ■» 


5.  > 


£.  > 


pD 


D  +  An ' 


D 


D  +  4n‘ 


where 


Note  that 


D  =  2(n  +  m)2  -  2n  +  7m  +  4 n' 


D  6  n2  +  E 


D  +  An2  lOn  2  +  E 

where 

£  —  2m2  +  4nm  -  2n  +  7m  , 
thus  E  >  0,  and  therefore 

S.  >  0.6p  , 


£p  >  0.6. 

In  Figs.  2,  3,  4  and  5,  we  plot  Ep  for  the  procedure  PSDM  for  the 
values  n=  8,  16,  32  and  64  respectively,  and  in  each  case  use  the  values  m  = 

1,  4  and  8.  It  is  then  easy  to  see  that  Ep  increases  with  m  ,  and  that  the 
efficiency  of  the  procedure  PSDM  is  bounded  from  below  by  the  efficiency  of  the 
procedures  FORWARD  and  REVERSE. 

VI.  CONCLUSIONS 

In  this  paper  a  parallel  implementation  of  the  best-step  steepest  descent 
method  has  been  presented  to  solve  the  LQR  optimal  control  problem.  The  pro¬ 
cedure  exhibits  the  desirable  property  that  its  structure,  and  hence  parallelism,  is 
determined  by  the  number  of  available  processors.  Thus,  unlike  approaches  in 


*r«vr*  S-*  v  «•»  VT»  .■v-.-T  W”  VWlni  W  \pl  W  W  3* iff)  Ifll TO  UV  W  W  W  U*  W  W  I* ! 
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which  the  structure  of  the  procedure  changes  with  problem  size,  the  procedure 
presented  in  this  paper  maintains  the  same  computational  and  interprocessor 
communication  requirements,  independently  of  the  number  of  stages  in  the  con¬ 
trol  problem.  Furthermore,  the  procedure  has  been  shown  to  exhibit  an  efficiency 
Ep  always  greater  than  0.6. 

The  paper’s  basic  approach  can  be  used  to  produce  parallel  implementa¬ 
tions  of  more  complex  gradient  based  methods.  For  example,  the  procedure 
PSDM  may  be  easily  modified  to  produce  the  following  parallel  version  of  the 
Fletcher-Reeves  conjugate  gradient  method. 

1.  PROCEDURE  PCGM(z0,u  ‘) 

2.  *  =  1; 

3.  tt1  =  <7 1  =  GRADIENT^  0,u  *); 

4.  WHILE  Iff*  |2  >  6  DO 

5.  dk  =  DIRECTION^1); 

6.  a*  =  <gk  ,nk  >/<dk  ,nk  >; 

7.  FORALL  i  f  {1,2,..., AT  }  DO  IN  PARALLEL  g,k  +1  :=  gk  -  ak  d,k; 

8.  FORALL  t  e  {1,2, ...,N  }  DO  IN  PARALLEL  «,*  +1  :=  «,*  -  a*  tt,*; 

9.  0k  =  <gk+l,gk+l>/<gk  ,gk  >; 

10.  FORALL  i  e  {1,2  ,...,JV  }  DO  IN  PARALLEL  tt,*  +1  :=  gk  +1  -  0k  tt,-*; 

11.  k  =  k  +  1; 

12.  END  WHILE 

13.  END  PROCEDURE 

The  procedure  PCGM  exhibits  slightly  less  speedup  than  the  procedure 
PSDM  due  to  the  additional  evaluation  of  the  term  /?*  .  However,  this  is  offset 
by  gaining  the  improved  convergence  properties  of  a  conjugate  gradient  method. 


'  k  •  .  •  *  ,  «  , 

.  '  .  *  kT'kJ'  \ 


V 

si 


Finally,  we  note  that  the  procedures  presented  in  this  paper  may  be 
used  to  solve  discrete  optimal  control  problems  which  involve  nonquadratic  cost, 
nonlinear  dynamics  and  constraints  on  the  states  and/or  controls.  In  that  case, 
one  needs  to  use  penalty  function  and  gradient  projection  methods  and  suitable 
approximations  to  cost  function  and  system  dynamics. 
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